3

1

RAG Workshop - Ingestion

Ingestion process involves chunking the documents, convert to embeddings and store it in a vectorDB

Estimated time: 60 minutes

by Kevin Lin, Andreas Mueller

2

Steps to Ingest Documents

Prep.: Create Index in AI Search (Vector DB)

Load documents from local storage or Azure Blob Storage
Chunk document(s) into manageable parts based on tokens.
Create embeddings for the document chunks.
Upload the embeddings to AI Search (Vector DB)

3

Why do we need chunking?

Chunking is the process of breaking documents into smaller, meaningful parts.
These parts are small enough for efficient embedding and retrieval.
Chunking ensures that our model can access relevant context without being overwhelmed by too much data.

4

What are Embeddings?

> An embedding is a vector (list) of floating point numbers. > The distance between two vectors measures their relatedness. Small distances suggest high relatedness and > large distances suggest low relatedness. > -- OpenAI

5

Vector relatedness

6

What Is a Vector Database?

A Vector Database stores embeddings for efficient search.
It helps quickly find relevant document chunks based on their semantic meaning.
Common options include Pinecone, Weaviate, and Azure AI Search.

7

Ingestion: Hands-On tasks

8

Step 1: Create Vector DB Index

Task

Set up a new index in the shared AI Search to store document chunks with embeddings

Expected Outcome

Go to AI Search

instance: srch-swex-rag-workshop - Indexes and verify your index is created

Click on your index, select field and verify the field configuration (embedding field needs to be retrievable)

9

Sample Code #1

# Create the index
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True),
    SearchableField(name="filename", type=SearchFieldDataType.String, filterable=True, sortable=True),
    SearchableField(name="content", type=SearchFieldDataType.String),
    SearchField(
        name="embedding",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        hidden=False,
        searchable=True,
        vector_search_dimensions=1536,
        vector_search_profile_name="default"  # use default `myHnswProfile`
    ),
]

vector_search = VectorSearch(
    profiles=[VectorSearchProfile(name="default", algorithm_configuration_name="default")],
    algorithms=[HnswAlgorithmConfiguration(name="default")]
)

index = SearchIndex(
    name=AI_SEARCH_INDEX_NAME,
    fields=fields,
    vector_search=vector_search
)

# Create the index
search_index = await ai_search_index_client.create_index(index)

10

Step 1: Create Vector DB Index - Hints

Use the index POST endpoint to trigger index creation
Use components from the Azure Search Documents library
Adapt AI\SEARCH\INDEX\NAME in the .env file to your name, use the pattern: [firstname]\[lastname]\rag\workshop
Make sure your index contains at least separate fields for id, filename, content and embeddings

Sample document representation

{
  "id": "0d1f45b1-c19a-4254-9390-5bd8a3b94c95",
  "filename": "document.pdf",
  "content": "This is the content of the document",
  "embedding": [
    0.00838001,
    ...
  ]
}

11

Step 2: Loading Documents

Task

Use the insurance documents

from Wiki - Workshop Document - Input Document as input document

Load the documents from local filepath.

Expected Outcome

The full text content of the PDF file is extracted as string for further processing

12

Step 2: Loading Documents - Hints

Use the pypdf library to extract the text content
Consider write a test to verify the content of the document is correctly loaded

Helper Prompt

Implement a Python function that Loads a local document to extract the text content. 
Use the library: pypdf~=5.0.0 for PDF file.

13

Step 3: Chunking Documents

Task

Split documents into smaller chunks based on token limits
Create token-based chunks to ensure manageable sizes for embedding and retrieval

Expected Outcome

A list of chunks represented as list of strings

14

Step 3: Chunking Documents - Hints

There are many different chunking algorithms out there. Focus on a simple, token based one.
You can either write your own or use the provided nltk
Don't waste too much time on finding the right one, you can fine tune it later

Helper Prompt

Implement a Python function to chunk the document content into parts based on tokens (e.g., 500 tokens per 
chunk). Ensure each chunk is self-contained without breaking sentences.
Use the library: nltk~=3.9.1

15

Step 4: Create Embedding(s)

Task

Use a deployed OpenAI model to create embeddings for individual or list of chunks

Expected Outcome

An embedding representation (list of floats) for each chunk

16

Step 4: Create Embedding(s) - Hints

Use a embedding client of Azure Open AI to generate the embedding
There are two different LLM models available in the env. config. Make sure to use the embedding model for this step.

Helper Prompt

Implement a Python function to generate embeddings for individual or multiple chunks using an Azure OpenAI 
model.
Use the library: openai~=1.46.1 or azure-ai-inference==1.0.0b5

17

Step 5: Upload documents to Vector DB

Task

Combine chunk contents with their embeddings into document representations
Upload these documents into your Azure AI Search DB (Vector DB)

Expected Outcome

Your vector DB index filled with your chunked document
Go

to srch-swex-rag-workshop - Indexes and search your index

Click on your index, click on Search on the Search explorer view
Verify that the index contains a list of documents

18

Step 5: Upload documents to Vector DB - Hints

Use the Azure Search Documents client to upload your documents
Remember the index you created earlier? The document structure needs to match the index fields exactly!

Helper Prompt

Help me to write a Python function that upload object which that contains contain text chunk and embedding to an 
index Azure AI Search. 
Use the library: azure-search-documents~=11.5.1

19